Model Presentation

Credit Risk Modelling Challenge

Olumide Oyalola

6/29/23

Credit risk modelling; Let’s dive in!

👇

Problem Identification

This is a supervised binary classification problem.

Exploratory Data Analysis

Historical loan default status

Applicant gender distribution

Applicant job category

Proportion of credit history

Loan purpose

Property type

Housing Type

Applicant Personal Status

Modeling

Data Partitioning

For the analyses, we start by holding back a testing set with initial_split(). The remaining data are split into training and validation sets:

Code
set.seed(1601)

credit_split <- initial_split(modifiedCredit_tbl,
                            prop = 0.75,
                            strata = default)
crd_train <- credit_split %>% 
  training()
crd_test <- credit_split %>% 
  testing()

set.seed(1602)

crd_val <- validation_split(crd_train, strata = default, prop = 4/5)

crd_val$splits[[1]]
<Training/Validation/Total>
<600/150/750>

Recipes in the wild

Code
crd_rec <- recipe(default ~ ., 
                  data = analysis(crd_val$splits[[1]])) %>% 

# Now add preprocessing steps to the recipe:

  step_impute_knn(all_predictors()) %>%
  step_zv(all_numeric_predictors()) %>% 
  step_orderNorm(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>%
  step_spatialsign(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_other(all_nominal_predictors()) %>% 
  step_filter_missing(all_nominal_predictors(), threshold = 0) 


crd_rec_trained <- 
  crd_rec %>% 
  prep(log_changes = TRUE)
step_impute_knn (impute_knn_xog4f): same number of columns

step_zv (zv_x4j7U): same number of columns

step_orderNorm (orderNorm_ASE8C): same number of columns

step_normalize (normalize_UdsR8): same number of columns

step_spatialsign (spatialsign_1scU2): same number of columns

step_dummy (dummy_yWhfj): 
 new (29): credit_history_delayed, credit_history_fully.repaid, ...
 removed (10): credit_history, purpose, personal_status, other_debtors, ...

step_other (other_tAQXZ): same number of columns

step_filter_missing (filter_missing_29F6e): same number of columns
Code
crd_rec_trained

show the histograms of the amount predictor before and after the recipe was prepared:

Feature Extraction

  • Principal Component Analysis
  • Partial Least Squares
  • Independent Component Analysis
  • Uniform Manifold Approximation and Projection

Principal Component Analysis

Partial Least Squares

Independent Component Analysis

Uniform Manifold Approximation and Projection

UMAP is similar to the popular t-SNE method for nonlinear dimension reduction

Model specification

Both the PLS and UMAP methods are worth investigating in conjunction with different models.

Code
# single-layer neural network

mlp_spec <- 
  mlp(hidden_units = tune(),
      penalty = tune(),
      epochs = tune()) %>% 
  set_engine("nnet") %>% 
  set_mode("classification")

# bagged trees

bagging_spec <- 
  bag_tree(cost_complexity = tune(),
           tree_depth = tune(), 
           min_n = tune(), 
           class_cost = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

Model ranking

Code
rankings <- 
  rank_results(crd_res, select_best = TRUE) %>% 
  mutate(method = map_chr(wflow_id, ~str_split(.x, "_", simplify = TRUE)[1]))

tidymodels_prefer()

rankings %>% filter(length(rank) > 0) %>% dplyr::select(rank, mean, model, method) %>% datatable()

Final model

Code
best_res <- 
  crd_res %>% 
  extract_workflow("pls_mlp") %>% 
  finalize_workflow(
    crd_res %>% 
      extract_workflow_set_result("pls_mlp") %>% 
      select_best(metric = "roc_auc")
  ) %>% 
  last_fit(split = credit_split, metrics = metric_set(roc_auc))

best_wflow_fit <- best_res$.workflow[[1]]

extract_fit_parsnip(best_wflow_fit)
parsnip model object

a 3-1-1 network with 6 weights
inputs: PLS1 PLS2 PLS3 
output(s): ..y 
options were - entropy fitting  decay=0.00000000024

What’s the model performance on test data?

Code
collect_metrics(best_res)
# A tibble: 1 x 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 roc_auc binary         0.804 Preprocessor1_Model1

ROC Curve and AUC Estimate

What about the confusion matrix?

What about the confusion matrix plot?

Variable Importance

Below is a plot of the variable importance. The importance of the features in the final model is represented visually.

Relative variable importance

Global explainer for the classification ML tidymodel on the credit data

What’s the take away here?

From the relative variable importance plot, checking_balance, months_loan_duration and credit_history are the top 3 important feature in the final model whereas the amount feature is the least important among the selected features in the final model.